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Abstract 

We consider the multi-armed bandit problems in which a player aims to accrue reward by sequen- 



o 

^ ' tially playing a given set of arms with unknown reward statistics. In the classic work, policies were 

proposed to achieve the optimal logarithmic regret order for some special classes of Ught-tailed reward 
distributions, e.g., Auer et al. 's UCBl index policy for reward distributions with finite support. In this 
QQ ' paper, we extend Auer et al. 's UCBl index policy to achieve the optimal logarithmic regret order for all 



light-tailed (or equivalently, locally sub-Gaussian) reward distributions defined by the (local) existence 
of the moment-generating function. 

I. Introduction 

In the classic MAB, there are N independent arms offering random rewards to a player. At 



^ I each time, the player chooses one arm to play and obtains a reward drawn i.i.d. over time from 
a distribution with unknown mean. Different arms may have different reward distributions. The 
design objective is a sequential arm selection policy that maximizes the total expected reward 
over a long but finite horizon T. 

Each received reward plays two roles: increasing the wealth of the player, and providing one 
more observation for learning the reward statistics of the arm. The tradeoff is thus between 
exploration and exploitation: which role should be emphasized in arm selection — an arm less 
explored thus holding potentials for the future or an arm with a good history of rewards. In 
1952, Robbins addressed the two-armed bandit problem [T|. He showed that the same maximum 
average reward achievable under a known model can be obtained by dedicating two arbitrary 
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sublinear sequences for playing each of the two arms. In 1985, Lai and Robbins proposed 
a finer performance measure, the so-called regret, defined as the expected total reward loss 
with respect to the ideal scenario of known reward models (under which the arm with the 
largest reward mean is always played) [2]. Regret not only indicates whether the maximum 
average reward under known models is achieved, but also measures the convergence rate of the 
average reward, or the effectiveness of learning. Lai and Robbins showed that the minimum regret 
has a logarithmic order in T. They also constructed explicit policies to achieve this minimum 
regret for Gaussian, Bernoulli, Poisson and Laplacian distributions assuming the knowledge of 
the distribution typelll. In 1995, Agrawal developed index-type policies that achieve O(logT) 
regret for Gaussian, Bernoulli, Poisson, Laplacian, and exponential distributions [|3l. In 2002, 
Auer et al. proposed a simpler index policy, referred to as UCBl, with O(logT) regret for 
reward distributions with finite support [31. UCBl policy does not require the knowledge of the 
distribution type; it only requires an upper bound on the finite support. 

These classic policies focus on finite-support reward distributions and several specific infinite- 
support light-tailed distributions. In this paper, we generalize Auer et al. 's index to achieve 
O(logT) regret for all light-tailed reward distributions. Light-tailed distributions, also referred 
to as locally sub-Gaussian distributions, are defined by the (local) existence of the moment- 
generating function. This work thus provides a simple index policy that achieves the optimal 
regret order for a broader class of reward distributions. 

MAB with general and unknown reward distributions was also considered in our prior work [5], 
where a Deterministic Sequencing of Exploration and Exploitation (DSEE) approach was pro- 
posed to achieve the logarithmic regret order for all light-tailed reward distributions. DSEE also 
achieves sublinear regret orders for heavy-tailed reward distributions. Specifically, for any p > 1, 
0(T^/p) regret can be achieved by DSEE when the moments of the reward distributions exist 
(only) up to the pih. order. The advantage of DSEE is its simple deterministic structure that 
can handle variations of MAB including general objectives, decentralized MAB with partial 
reward observations, and rested/restless Markovian reward models O. However, compared to 
the extended UCBl policy that adaptively adjusts the number of plays on each arm based on 



'For the existence of an optimal policy in general, Lai and Robbins established a sufficient condition on the reward distributions. 
However, the condition is difficult to check and is only verified for the specific distributions mentioned above. 



3 



observations, DSEE spends equal amount of time during the exploration phase for learning the 
reward statistics. Simulation results indicate that the extended UCBl policy can have a better 
leading constant in the logarithmic regret order. 

Other work on extensions of the UCBl policy includes |I3-|[II1- In [|3-[H, UCBl was 
extended to handle decentralized MAB with multiple distributed players. In ifTOl . ifTTIl . UCBl 
was extended to the rested and restless Markovian reward models, respectively. 

II. The Classic MAB 

In this section, we present the non-Bayesian formulation of the classic MAB and Auer et al. 's 
UCBl policy. 

A. Problem Formulation 

Consider an A^-arm bandit and a single player. At each time t, the player chooses one arm 
to play. Playing arm n yields i.i.d. random reward X„(t) drawn from an unknown distribution 
fn{s)- Let = (/i(s), ■ ■ ■ , fN{s)) denote the set of the unknown distributions. We assume that 
the reward mean On = E[X„(t)] exists for all 1 < n < A^. 

An arm selection policy tt is a function that maps from the player's observation and decision 
history to the arm to play. Let cr be a permutation of {1, ■ ■ ■ , A^} such that 9a{i) > Qa(2) > ■ ■ • > 
Ga{N)- The system performance under policy tt is measured by the regret RJ^{J') defined as 

where is the random reward obtained at time t under policy tt, and denotes the 

expectation with respect to policy tt. The objective is to minimize the rate at which BJ^{J^) 
grows with T under any distribution set F by choosing an optimal policy tt*. Although all 
policies with sublinear regret achieve the maximum average reward, the difference in their total 
expected reward can be arbitrarily large as T increases. The minimization of the regret is thus 
of great interest. 

B. UCBl Policy 

In Auer et al. 's UCBl policy H, an index I{t) is computed for each arm and the arm with 
the largest index is chosen. Assume that the support of the reward distributions is normalized 
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to [0, 1]. The index (referred to as the upper confidence bound) has the following simple form: 

m = m + ^2^. (1) 

This index form is intuitive in the light of Lai and Robbins's result on the logarithmic order of 
the minimum regret which indicates that each arm needs to be explored on the order of logt 
times. For an arm sampled at a smaller order than logt, its index, dominated by the second term, 
will be sufficient large for large t to ensure further exploration. 

Based on the Chernoff-Hoeffding bound on the convergence of the sample mean for distribu- 
tions with finite support lfT2l . Auer et al. established a regret growing at the logarithmic order 
with time. Furthermore, an upper bound on the regret accumulated up to any finite time was 
also established. 

in. Extension OF UCBl Policy 
In this section, we generalize UCBl for the class of light-tailed reward distributions. 

A. Light-Tailed Reward Distributions 

The class of light-tailed reward distributions are defined by the (local) existence of the moment- 
generating function. Such reward distributions are also referred to as locally sub-Gaussian 
distributions (see [[T3l ). 

Definition 1: The moment-generating function M{u) = E[exp(nX)] of a random variable X 
exists if there exists a mq > such that 

M{u) < oo V n < |no|. (2) 

By the mean-value theorem, the function M{u) is infinitely differentiable. A direct application 
of Taylor's theorem leads to the following upper bound on M(u) (see Theorem 1 in lfT3l ). Without 
loss of generality, assume that E[X] = 0. We have 

M{u) < exp{Cu^/2), V M < |mo|, C > snp{M^^\u), - uq < u < uq}, (3) 

where M*^^^(-) denotes the second derivative of M(-) and uq the parameter specified in We 
observe that the upper bound in ^ is the moment-generating function of a zero-mean Gaussian 
random variable with variance (. The distributions satisfying ^ (i.e., with a finite moment- 
generating function around 0) are thus called locally sub-Gaussian distributions. If there is no 
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constraint on the parameter u in ([3]), the corresponding distributions are referred to as sub- 
Gaussian. 

Based on ([3]), we show a Bernstein-type bound on the convergence rate of the sample mean 
as given in the lemma below. 

Lemma 1: Consider i.i.d. light-tailed random variables {X{t)}'^-^ with a finite moment-generating 
function over range [-uq, uq]. Let Xt = {T.\^^X{k)) /t and 6 = E[X{1)]. We have, for all e > 0, 

[ exp {-!ft) , e > Clio 

where uq are the parameters specified in ([3]). The same bound holds for Fr{Xt — < — e} by 
symmetry. 

Proof: The proof follows a similar line of arguments as given in ffT4ll . We provide it below 
for completeness. By Markov's inequality, V m G [0,mo], 

Pr{Xi - ^ > e} = FT{ut(Xt - 9) > ute} (5) 
< 



E[exp(Mt(Xi - 6))] 




exp (lite) 




E[exp(EUiw(X(A;) - 


-9))] 


exp(nte) 




E[nLiexp(n(X(/c) 


-0))] 


exp(Mi:e) 




nLiE[exp(«(X(/c) 


-e))] 



exp(Mte) 

< exp(tC^iV2 - ute) 

< ^ (6) 

exp(^-tMoe), e > Cmo 



It is not difficult to show that if e > C^uq, then 

tuQt < — —e. (7) 

Based on ^ and dV]), we arrive at (H]). ■ 
Note that for a small sample mean deviation (e < C^o)^ the bound has a similar form to the 
classical Chemoff-Hoeffding bound for finite-support distributions. Although the bound for large 
sample mean deviations has a different form (linear in the deviation e rather than quadratic in 
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the exponent), it preserves the exponential decaying rate in terms of both the sample size and the 
deviation e. These properties of light-tailed reward distributions lead to the following extension 
of Auer et al. 's index (UCBl) policy while preserving the logarithmic regret order. 

B. Extended UCBl 

As mentioned in Sec. III-Bl the second term in Auer et al. 's index ([T]) is used for specifying 
the upper confidence bound to ensure sufficient but bounded explorations on each arm, given that 
the Chernoff-Hoeffding bound holds. To adapt to the Bemstein-type bound given in Lemma [H 
we consider two upper confidence bounds and determine which one to use at each time based 
on their values. The detailed algorithm is shown in Fig. [B Note that for sub-Gaussian reward 
distributions {i.e., uq = oo), the extended UCBl is reduced to the case in which only one upper 
confidence bound is used as the index function of each arm. This upper confidence bound, as 
given in dS]), has the same form of ([I]) except for a difference in choosing the parameter ai. 

Theorem 1: For all light-tailed arm reward distributions, the regret of the extended UCBl 
policy for any T > 1 is bounded by 

4ai 2a2 1 , ^ ^ vr^ 



Proof: Define 



i?T(^)< E (^.(i)-^n,)(max "-^\^ogT + l + -). (10) 



c{t, s) 



s 



a-, loRt I g-i logt ^ 



Following a similar procedure as in p|, for any integers L > and n such that 9n < 6a{i)^ we 
have 

T 

E[r„(T)] < L + EPr{^„(r„(t)) + c(t,r„(t)) > ^.(i)(r,(i)(t)) + c(t, r,(i)(t))&r„(t) > L} 
t=i 

oo t-1 t-1 



i=l s=l k=L 
oo t-1 t-1 



Choose 



t=l s=l k=L 

+ PrK + 2c(t,A;) >^,(i)}). (11) 



. 4ailogT 2a2logT 
Lo = [max { 

"(t(1) — On) C7ct(1) — On . 
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The UCBl-LT Policy n* 

Notations and Inputs: Let r,i(t) denote the number of plays on arm 
n up to (but excluding) time t and 9n{Tn{t)) the sample mean of 
arm n at time t. Choose cii > 8^ and 02 > ai/(Cwo)- Define two 
index functions 



j(i) 

n 
/{2) 



OnMt)) 




(8) 
(9) 



Initialization: In the first N steps, play all arms once to obtain 
the initial sample means. 
At time t > N, 

1. for each arm n, if ^J^^^^ < (uq, compute its index 
according to In\ otherwise compute its index according to 

r(2). 

2. play the arm with the largest index. 



Fig. 1. The extended UCBl for light-tailed reward distributions. 



For any A; > Lq, we have 

c{t,k) < max^ "^^"^^ 



< max 



tti log t a2 log t 



< niax<;,/a,logt.^-^-^,a2logt.^-j-^ 

0(7(1) — 



(12) 



2 

From (fTTI) and (fT2l) . we have 

cx> t-1 t-1 

E[r„(T)] <Lo + Y,Y.Yl (P^iMk) > On + c{t, k)} + PrR(i)(s) < ^,(1) - c{t, s)}). (13) 

t=l s=l k=Lo 
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Next, we bound the probabilities in (fT3l) by the Bernstein-type bound (H]). If 



k 
then 



ailogt 



PrR(fc)>^„ + c(t,fc)} = Pr<'^„(A;) ' ./«ilogt 



, k I laihgt 
< exp — — 



A; 

2^ 



2C V V k 



otherwise 



and we have 



Pr{^n(A;)>^„ + c(t,A;)} = Pr > ^„ 4 



< 



exp 



k 

kuQ a2 log t 



< exp 



2 A; 
/cuo cti log t 
2 C^o^ 



The same bound also applies on 

PrR(i)(s) <^.(i) -c(t,s)}. 

We thus have 



oo t-1 t-1 

4 



E[rn(r)] < Lo + 2j]J]J]t 

t=l s=l k=Lo 

4ai log T 2a2 log T ) TT 
< max < — — - , — > + 1 



^o-(l) — (^nY ' — On j 3 



2 



as desired. 



(14) 
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IV. Conclusion 

In this paper, we have considered a broader class of reward distributions for MAB problems. 
Auer et al. 's UCBl policy was extended to achieve a uniform logarithmic regret bound over 
time for all light-tailed reward distributions. 

References 

[I] H. Robbins, "Some Aspects of the Sequential Design of Experiments," Bull. Amer. Math. Soc, vol. 58, no. 5, pp. 527-535, 
1952. 

[2] T. Lai and H. Robbins, "Asymptotically Efficient Adaptive Allocation Rules," Advances in Applied Mathematics, vol. 6, 
no. 1, pp. 4-22, 1985. 

[3] R. Agrawal, "Sample Mean Based Index Policies with Oilogn) Regret for the Multi-armed Bandit Problem," Advances in 

Applied Probability, vol. 27, pp. 1054-1078, 1995. 
[4] P. Auer, N. Cesa-Bianchi, P. Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem," Machine Learning, vol. 47, 

pp. 235-256, 2002. 

[5] K. Liu and Q. Zhao, "Multi-Armed Bandit Problems with Heavy-Tailed Reward Distributions," in Proc. of Allerton 

Conference on Communications, Control, and Computing, September, 2011. Available at |http://arxiv.org/abs/l 106.6104[ 
[6] H. Liu, K. Liu, and Q, Zhao, "Learning in A Changing World: Restless Multi-Armed Bandit with Unknown Dynamics," 

submitted to IEEE Transactions on Information Theory, November, 2011. Available at |http://arxiv.org7abs/l 01 1 .4969]| 
[7] Y. Gai and B. Krishnamachari, "Decentralized Online Learning Algorithms for Opportunistic Spectrum Access," Technical 

Report, March, 2011. Available at http://anrg.usc.edu/www/publications/papers/DMAB2011.pdf 
[8] K. Liu and Q. Zhao, "Decentralized Multi-Armed Bandit with Distributed Multiple Players," IEEE Transactions on Signal 

Processing, vol. 58, no. 11, pp. 5667-5681, November, 2010. 
[9] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, "Distributed Algorithms for Learning and Cognitive Medium 

Access with Logarithmic Regret," IEEE JSAC on Advances in Cognitive Radio Networking and Communications, vol. 29, 

no. 4, pp. 781-745, Apr. 2011. 

[10] C. Tekin, M. Liu, "Online Algorithms for the Multi-Armed Bandit Problem With Markovian Rewards," Proc. of Allerton 
Conference on Communications, Control, and Computing, Sep., 2010. 

[II] C. Tekin, M. Liu, "Online Learning in Opportunistic Spectrum Access: A Restless Bandit Approach," INFOCOM, April, 
2011. 

[12] W. Hoeffding, "Probability Inequalities for Sums of Bounded Random Variables," Journal of the American Statistical 

Association, vol. 58, no. 301, pp. 13-30, March, 1963. 
[13] P. Chareka, O. Chareka, S. Kennendy, "Locally Sub-Gaussian Random Variable and the Stong Law of Large Numbers," 

Atlantic Electronic Journal of Mathematics, vol. 1, no. 1, pp. 75-81, 2006. 
[14] R. Vershynin, "Introduction to the Non-Asymptotic Analysis of Random Matrices," available at 

|http://arxiv.org/abs/10Tl.3027v6| 



